Machine Learning Analysis Pipeline
EDR: Dataset Loading & Preprocessing
EDR – Train/Test Overview
• Train shape: (9561, 20) | Test shape: (818, 20)
• Total train samples: 9,561 | Total test samples: 818
• Number of features: 18
• Target column: 'label'
• Missing values (train): 0 | (test): 0
• Train shape: (9561, 20) | Test shape: (818, 20)
• Total train samples: 9,561 | Total test samples: 818
• Number of features: 18
• Target column: 'label'
• Missing values (train): 0 | (test): 0
EDR – Train Class Distribution
• 0: 8,704
• 1: 857
• Class balance (minority/majority): 9.8460%
• 0: 8,704
• 1: 857
• Class balance (minority/majority): 9.8460%
EDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
Baseline (Most-Frequent) Accuracy: 0.9095
EDR: Model Performance Comparison
EDR – Model Performance Metrics
| Model | Accuracy | Balanced Acc | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.8924 | 0.5819 | 0.3409 | 0.2027 | 0.2542 | 0.6143 | 0.2095 |
| Random Forest (SMOTE) | 0.9010 | 0.5927 | 0.4103 | 0.2162 | 0.2832 | 0.7418 | 0.2957 |
| LightGBM | 0.8900 | 0.6170 | 0.3621 | 0.2838 | 0.3182 | 0.8253 | 0.3332 |
| Balanced RF | 0.8631 | 0.6874 | 0.3241 | 0.4730 | 0.3846 | 0.8345 | 0.3140 |
| SGD SVM | 0.8851 | 0.5717 | 0.2917 | 0.1892 | 0.2295 | nan | nan |
| IsolationForest | 0.3900 | 0.4821 | 0.0858 | 0.5946 | 0.1499 | nan | nan |
Confusion Matrix Analysis
| Model | TN | FP | FN | TP | FP Rate | Miss Rate |
|---|---|---|---|---|---|---|
| Logistic Regression | 715 | 29 | 59 | 15 | 3.90% | 79.73% |
| Random Forest (SMOTE) | 721 | 23 | 58 | 16 | 3.09% | 78.38% |
| LightGBM | 707 | 37 | 53 | 21 | 4.97% | 71.62% |
| Balanced RF | 671 | 73 | 39 | 35 | 9.81% | 52.70% |
| SGD SVM | 710 | 34 | 60 | 14 | 4.57% | 81.08% |
| IsolationForest | 275 | 469 | 30 | 44 | 63.04% | 40.54% |
Best Models by Metric
Accuracy
Random Forest (SMOTE)
0.9010
Balanced Acc
Balanced RF
0.6874
Precision
Random Forest (SMOTE)
0.4103
Recall
IsolationForest
0.5946
F1
Balanced RF
0.3846
ROC-AUC
Balanced RF
0.8345
PR-AUC
LightGBM
0.3332
Lowest False Positive Rate
Random Forest (SMOTE)
3.09%
Lowest Miss Rate
IsolationForest
40.54%
EDR – Metrics by Model
EDR – ROC Curves
EDR – Precision–Recall Curves
EDR – Predicted Probability Distributions
EDR – Threshold Sweep
EDR: Logistic Regression – Detailed Analysis
EDR – Logistic Regression: Confusion Matrix
EDR – Logistic Regression: Confusion Matrix
EDR – Logistic Regression: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9238 | 0.9610 | 0.9420 | 744.0000 |
| 1 | 0.3409 | 0.2027 | 0.2542 | 74.0000 |
| accuracy | nan | nan | 0.8924 | 818.0000 |
EDR – Logistic Regression: Feature Importance
EDR – Logistic Regression: Feature Importance
EDR: Random Forest (SMOTE) – Detailed Analysis
EDR – Random Forest (SMOTE): Confusion Matrix
EDR – Random Forest (SMOTE): Confusion Matrix
EDR – Random Forest (SMOTE): Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9255 | 0.9691 | 0.9468 | 744.0000 |
| 1 | 0.4103 | 0.2162 | 0.2832 | 74.0000 |
| accuracy | nan | nan | 0.9010 | 818.0000 |
EDR – Random Forest (SMOTE): Feature Importance
EDR – Random Forest (SMOTE): Feature Importance
EDR: LightGBM – Detailed Analysis
EDR – LightGBM: Confusion Matrix
EDR – LightGBM: Confusion Matrix
EDR – LightGBM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9303 | 0.9503 | 0.9402 | 744.0000 |
| 1 | 0.3621 | 0.2838 | 0.3182 | 74.0000 |
| accuracy | nan | nan | 0.8900 | 818.0000 |
EDR – LightGBM: Feature Importance
EDR – LightGBM: Feature Importance
EDR: Balanced RF – Detailed Analysis
EDR – Balanced RF: Confusion Matrix
EDR – Balanced RF: Confusion Matrix
EDR – Balanced RF: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9451 | 0.9019 | 0.9230 | 744.0000 |
| 1 | 0.3241 | 0.4730 | 0.3846 | 74.0000 |
| accuracy | nan | nan | 0.8631 | 818.0000 |
EDR – Balanced RF: Feature Importance
EDR – Balanced RF: Feature Importance
EDR: SGD SVM – Detailed Analysis
EDR – SGD SVM: Confusion Matrix
EDR – SGD SVM: Confusion Matrix
EDR – SGD SVM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9221 | 0.9543 | 0.9379 | 744.0000 |
| 1 | 0.2917 | 0.1892 | 0.2295 | 74.0000 |
| accuracy | nan | nan | 0.8851 | 818.0000 |
EDR – SGD SVM: Feature Importance
EDR – SGD SVM: Feature Importance
EDR: IsolationForest – Detailed Analysis
EDR – IsolationForest: Confusion Matrix
EDR – IsolationForest: Confusion Matrix
EDR – IsolationForest: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9016 | 0.3696 | 0.5243 | 744.0000 |
| 1 | 0.0858 | 0.5946 | 0.1499 | 74.0000 |
| accuracy | nan | nan | 0.3900 | 818.0000 |
EDR – IsolationForest: Feature Importance
Feature importance not available for this model type.
XDR: Dataset Loading & Preprocessing
XDR – Train/Test Overview
• Train shape: (9561, 34) | Test shape: (818, 34)
• Total train samples: 9,561 | Total test samples: 818
• Number of features: 32
• Target column: 'label'
• Missing values (train): 0 | (test): 0
• Train shape: (9561, 34) | Test shape: (818, 34)
• Total train samples: 9,561 | Total test samples: 818
• Number of features: 32
• Target column: 'label'
• Missing values (train): 0 | (test): 0
XDR – Train Class Distribution
• 0: 8,704
• 1: 857
• Class balance (minority/majority): 9.8460%
• 0: 8,704
• 1: 857
• Class balance (minority/majority): 9.8460%
XDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
Baseline (Most-Frequent) Accuracy: 0.9095
XDR: Model Performance Comparison
XDR – Model Performance Metrics
| Model | Accuracy | Balanced Acc | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.5232 | 0.5432 | 0.1050 | 0.5676 | 0.1772 | 0.5977 | 0.1980 |
| Random Forest (SMOTE) | 0.9046 | 0.5642 | 0.4231 | 0.1486 | 0.2200 | 0.7342 | 0.3067 |
| LightGBM | 0.9022 | 0.5994 | 0.4250 | 0.2297 | 0.2982 | 0.8470 | 0.3763 |
| Balanced RF | 0.8680 | 0.6840 | 0.3333 | 0.4595 | 0.3864 | 0.8263 | 0.2972 |
| SGD SVM | 0.1198 | 0.5040 | 0.0911 | 0.9730 | 0.1667 | nan | nan |
| IsolationForest | 0.8301 | 0.5537 | 0.1649 | 0.2162 | 0.1871 | nan | nan |
Confusion Matrix Analysis
| Model | TN | FP | FN | TP | FP Rate | Miss Rate |
|---|---|---|---|---|---|---|
| Logistic Regression | 386 | 358 | 32 | 42 | 48.12% | 43.24% |
| Random Forest (SMOTE) | 729 | 15 | 63 | 11 | 2.02% | 85.14% |
| LightGBM | 721 | 23 | 57 | 17 | 3.09% | 77.03% |
| Balanced RF | 676 | 68 | 40 | 34 | 9.14% | 54.05% |
| SGD SVM | 26 | 718 | 2 | 72 | 96.51% | 2.70% |
| IsolationForest | 663 | 81 | 58 | 16 | 10.89% | 78.38% |
Best Models by Metric
Accuracy
Random Forest (SMOTE)
0.9046
Balanced Acc
Balanced RF
0.6840
Precision
LightGBM
0.4250
Recall
SGD SVM
0.9730
F1
Balanced RF
0.3864
ROC-AUC
LightGBM
0.8470
PR-AUC
LightGBM
0.3763
Lowest False Positive Rate
Random Forest (SMOTE)
2.02%
Lowest Miss Rate
SGD SVM
2.70%
XDR – Metrics by Model
XDR – ROC Curves
XDR – Precision–Recall Curves
XDR – Predicted Probability Distributions
XDR – Threshold Sweep
XDR: Logistic Regression – Detailed Analysis
XDR – Logistic Regression: Confusion Matrix
XDR – Logistic Regression: Confusion Matrix
XDR – Logistic Regression: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9234 | 0.5188 | 0.6644 | 744.0000 |
| 1 | 0.1050 | 0.5676 | 0.1772 | 74.0000 |
| accuracy | nan | nan | 0.5232 | 818.0000 |
XDR – Logistic Regression: Feature Importance
XDR – Logistic Regression: Feature Importance
XDR: Random Forest (SMOTE) – Detailed Analysis
XDR – Random Forest (SMOTE): Confusion Matrix
XDR – Random Forest (SMOTE): Confusion Matrix
XDR – Random Forest (SMOTE): Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9205 | 0.9798 | 0.9492 | 744.0000 |
| 1 | 0.4231 | 0.1486 | 0.2200 | 74.0000 |
| accuracy | nan | nan | 0.9046 | 818.0000 |
XDR – Random Forest (SMOTE): Feature Importance
XDR – Random Forest (SMOTE): Feature Importance
XDR: LightGBM – Detailed Analysis
XDR – LightGBM: Confusion Matrix
XDR – LightGBM: Confusion Matrix
XDR – LightGBM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9267 | 0.9691 | 0.9474 | 744.0000 |
| 1 | 0.4250 | 0.2297 | 0.2982 | 74.0000 |
| accuracy | nan | nan | 0.9022 | 818.0000 |
XDR – LightGBM: Feature Importance
XDR – LightGBM: Feature Importance
XDR: Balanced RF – Detailed Analysis
XDR – Balanced RF: Confusion Matrix
XDR – Balanced RF: Confusion Matrix
XDR – Balanced RF: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9441 | 0.9086 | 0.9260 | 744.0000 |
| 1 | 0.3333 | 0.4595 | 0.3864 | 74.0000 |
| accuracy | nan | nan | 0.8680 | 818.0000 |
XDR – Balanced RF: Feature Importance
XDR – Balanced RF: Feature Importance
XDR: SGD SVM – Detailed Analysis
XDR – SGD SVM: Confusion Matrix
XDR – SGD SVM: Confusion Matrix
XDR – SGD SVM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9286 | 0.0349 | 0.0674 | 744.0000 |
| 1 | 0.0911 | 0.9730 | 0.1667 | 74.0000 |
| accuracy | nan | nan | 0.1198 | 818.0000 |
XDR – SGD SVM: Feature Importance
XDR – SGD SVM: Feature Importance
XDR: IsolationForest – Detailed Analysis
XDR – IsolationForest: Confusion Matrix
XDR – IsolationForest: Confusion Matrix
XDR – IsolationForest: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9196 | 0.8911 | 0.9051 | 744.0000 |
| 1 | 0.1649 | 0.2162 | 0.1871 | 74.0000 |
| accuracy | nan | nan | 0.8301 | 818.0000 |
XDR – IsolationForest: Feature Importance
Feature importance not available for this model type.